Learning outcomes

  • Obtaining differential expression statistics for different contrasts
  • Using annotation databases to map between gene identifers
  • Construction and interpretation of common visualisations for RNA-seq
    • scatter plots
    • volcano plots
    • MA-plots
    • heatmaps

We can now have a list of genes ordered according to their evidence for being differentially-expressed. You should have saved a de_treated.rds object in the previous session.

de_treated <- readRDS("Robjects/de_treated.rds")
dds <- readRDS("Robjects/dds.rds")

If you don’t have it, you can download directly into RStudio using the code below:-

dir.create("Robjects/")
download.file("https://github.com/sheffield-bioinformatics-core/bms31004_2022/raw/main/Robjects/de_treated.rds",destfile = "Robjects/de_treated.rds")
download.file("https://github.com/sheffield-bioinformatics-core/bms31004_2022/raw/main/Robjects/dds.rds",destfile = "Robjects/dds.rds")


de_treated <- readRDS("Robjects/de_treated.rds")
dds <- readRDS("Robjects/dds.rds")
library(tidyverse)
library(DESeq2)

An overview of the results is to use the plotMA function. Each point on this plot represents and individual gene with the x- and y-axes being the overall expression level and magnitude of difference respectively. Significant genes are automatically highlighted, although at this stage we don’t have a convenient way of labeling the genes. The fanning effect at low expression levels is often seen due to high relative fold-change at low expression levels.

plotMA(de_treated)

It is also instructive to perform a “sanity” check and plot the sample-level counts for genes with high significance. This could highlight any other technical factors that we are not currently taking into account. The plot is not particularly attractive, but is a good quick diagnostic. If you pick genes with a large fold-change (and/or low p-value) you can verify the “direction” of the contrast and which sample group was used as a baseline.

results_treated <- results(de_treated, tidy=TRUE) %>% arrange(padj)
## this gene has a positive fold-change
plotCounts(dds,"ENSG00000158258",intgroup = "Treated")

## this gene has a negative fold-change
plotCounts(dds,"ENSG00000136999",intgroup = "Treated")

From the plots we can confirm that un-treated samples (annotated with N in the Treatment group) was used as a baseline; in other words a positive fold-change means a higher expression level in Treated samples. We will see shortly how to change the direction of the contrast.

If your study involves knocking-out a particular gene, or you have some positive controls that are known in advance, it would be a good idea to visualise their expression level with plotCounts.

Exercise

  • Re-run the DESeq function to find differentially-expressed genes between the TGF treated samples and CTRL. You will need to change the design that is used.
  • Write a csv file that contains results for the genes that have a p-value less than 0.05 and a log2 fold change more than 1, or less than -1.
    • HINT: So that we don’t overwrite our results so far, it may be convenient to create a new DESeqDataSet object for the new differential expression analysis. Check the colData to see which analyses can be made
  • Use the plotCounts function to visually-inspect the most statistically-significant gene identified
dds_condition<- dds
colData(dds_condition)
## You will need to change this line to choose the correct comparison
design(dds_condition) <- ~...

Changing the direction of the contrast

In this initial analysis DESeq2 has automatically decided which member of our sample groups to use as our baseline (CTR in this case) so that the log2 fold changes are reported with a positive value meaning higher expression in Treated. If we want to change this behaviour we can change the contrast argument in the results function

## This should give the same as the table above
results(de_treated, contrast=c("Treated","Y","N"))
## Changing the direction of the contrast
results(de_treated, contrast=c("Treated","N","Y"))

Analyses involving more than two groups

If we change to performing differential expression analysis on the condition variable then there are various contrasts that can be made; IR vs CTR, TGF vs CTR etc. When the results function is run, the table that is displayed is for the contrast TGF vs CTR. The resultsNames function can tell us which other contrasts we can access.

dds_condition<- dds
dds_condition$condition <- as.factor(stringr::str_to_upper(dds_condition$condition))
design(dds_condition) <- ~condition
de_condition <- DESeq(dds_condition)
resultsNames(de_condition)
[1] "Intercept"            "condition_IR_vs_CTR" 
[3] "condition_TGF_vs_CTR"

For a factor with more than 2 groups, the contrast argument can be used to change the output. Here, we compare the IR to CTR -treated groups.

results(de_condition, contrast = c("condition","IR","CTR"),tidy=TRUE)  %>%
arrange(padj) 

If we wanted to change the analysis to IR versus TGF we don’t have to re-run the analysis. The results function can be used to extract a different contrast.

results(de_condition, contrast = c("condition","IR","TGF"),tidy=TRUE) %>%
arrange(padj) 

Analysing a subset of the data

The tximport and DESeq2 packages make it quite convenient to import a set of RNA-seq counts into R. However, we don’t neccesarily have to use all the samples in our differential expression analysis.

For example, we might decide, based on the sample QC, that one of our samples should be removed as an outlier.

The dds object can be treated with the usual subsetting conventions as a count matrix (even though it is a more complex data type). Specifically we can define a set of column indices to subset the samples.

e.g. this will return a DESeqDataset using the first four samples from dds.

dds[,1:4]
class: DESeqDataSet 
dim: 57914 4 
metadata(1): version
assays(2): counts avgTxLength
rownames(57914): ENSG00000000003 ENSG00000000005 ...
  ENSG00000284747 ENSG00000284748
rowData names(0):
colnames(4): 1_CTR_BC_2 2_TGF_BC_4 3_IR_BC_5
  4_CTR_BC_6
colData names(5): Run condition Name Replicate
  Treated
samples_to_keep <- c("1_CTR_BC_2","2_TGF_BC_4","3_IR_BC_5",
                     "4_CTR_BC_6","5_TGF_BC_7", "6_IR_BC_12",
                     "7_CTR_BC_13","8_TGF_BC_14","9_IR_BC_15")
dds_final <- dds[,dds$Run %in% samples_to_keep]
## Proceed as usual
samples_to_remove <- "7_CTR_BC_13"
dds_final <- dds[,!dds$Run %in% samples_to_remove]

Or based on some criteria such as the number of reads

number_of_reads <- colSums(counts(dds)) / 1e6
dds_final <- dds[,number_of_reads > 35]

Finally, we might decide to analyse a specific biological group.

dds_treated <- dds[,dds$Treated == "Y"]

More-complex designs

The examples we have used so far have performed a differential expression analysis using a named column in the colData object. The DESeq2 package is capable of performing more complex analyses that can take multiple factors into consideration at the same time; so-called “multi-factor designs”

The use of such a design could be motivated by discovering sources of technical variation in our data that might obscure the biological differences we would like to compare. e.g.

In the example image above the main source of variation is the batch in which the samples were sequenced. A multi-factor analysis to compare the various conditions, but “correct” for differences in batch, would be as follows.

### Don't run this. It's just a code example
design(MY_DATA) <- ~ batch + condition

Likewise, if we have different treatments applied to difference cell-lines, but the main source of variation is the cell line (as in Weekly Exercise 4), the following could be used.

### Don't run this. It's just a code example
design(MY_DATA) <- ~cell_line + treatment

Adding annotation to the DESeq2 results

We would love to share this list with our collaborators, or search for our favourite gene in the results. However, the results are not very useful in there current form as each row is named according to an Ensembl identifier. Whilst gene symbols are problematic and can change over time, they are the names that are most recognisable and make the results easier to navigate.

There are a number of ways to add annotation, but we will demonstrate how to do this using the org.Hs.eg.db package. This package is one of several organism-level packages in Bioconductor that are re-built every 6 months. These packages are listed on the annotation section of the Bioconductor, and are installed in the same way as regular Bioconductor packages. An alternative approach is to use biomaRt, an interface to the BioMart resource. BioMart is much more comprehensive, but the organism packages fit better into the Bioconductor workflow.

### Only execute when you need to install the package
install.packages("BiocManager")
BiocManager::install("org.Hs.eg.db")
# For Human
BiocManager::install("org.Hs.eg.db")

The packages are larger in size that Bioconductor software pacakges, but essentially they are databases that can be used to make offline queries.

library(org.Hs.eg.db)

The following error may be seen loading this library on some versions of RStudio.

Show in New WindowClear OutputExpand/Collapse Output
Error: package or namespace load failed for ‘org.Hs.eg.db’: .onLoad failed in loadNamespace() for 'org.Hs.eg.db', details: call: l$contains error: $ operator is invalid for atomic vectors

A solution is presented in this forum post

options(connectionObserver = NULL)
library(org.Hs.eg.db)

First we need to decide what information we want. In order to see what we can extract we can run the columns function on the annotation database.

columns(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"     
 [4] "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
 [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL" 
[10] "GENENAME"     "GENETYPE"     "GO"          
[13] "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL" 
[19] "PATH"         "PFAM"         "PMID"        
[22] "PROSITE"      "REFSEQ"       "SYMBOL"      
[25] "UCSCKG"       "UNIPROT"     

We are going to filter the database by a key or set of keys in order to extract the information we want. Valid names for the key can be retrieved with the keytypes function.

keytypes(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"     
 [4] "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
 [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL" 
[10] "GENENAME"     "GENETYPE"     "GO"          
[13] "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL" 
[19] "PATH"         "PFAM"         "PMID"        
[22] "PROSITE"      "REFSEQ"       "SYMBOL"      
[25] "UCSCKG"       "UNIPROT"     

We should see ENSEMBL, which is the type of key we are going to use in this case. If we are unsure what values are acceptable for the key, we can check what keys are valid with keys

keys(org.Hs.eg.db, keytype="ENSEMBL")[1:10]
 [1] "ENSG00000121410" "ENSG00000175899" "ENSG00000256069"
 [4] "ENSG00000171428" "ENSG00000156006" "ENSG00000196136"
 [7] "ENSG00000114771" "ENSG00000127837" "ENSG00000129673"
[10] "ENSG00000090861"

For the top gene in our analysis the call to the function would be:-

select(org.Hs.eg.db, keys="ENSG00000158258",
       keytype = "ENSEMBL",columns=c("SYMBOL","GENENAME")
)

Unfortunately, the authors of dplyr and AnnotationDbi have both decided to use the name select in their packages. To avoid confusion and errors, the following code is sometimes used to explictly tell RStudio to use the select function from AnnotationDbi.

AnnotationDbi::select(org.Hs.eg.db, keys="ENSG00000158258",keytype = "ENSEMBL",columns=c("SYMBOL","GENENAME"))

To annotate our results, we definitely want gene symbols and perhaps the full gene name. Let’s build up our annotation information into a new data frame using the select function. The keys argument to select is a vector of valid identifiers. The names of all Ensembl IDs used in the analysis can be found as the rownames of our DESeq2 object.

## Get the names of all genes used in the analysis
all_ids <- rownames(dds)
anno <- AnnotationDbi::select(org.Hs.eg.db,keys=all_ids,
              columns=c("SYMBOL","GENENAME"),
              keytype="ENSEMBL")
# Have a look at the annotation
head(anno)

However, we have a problem that the resulting data frame has more rows than our results table. This is due to the one-to-many relationships that often occur when mapping between various identifiers.

dim(anno)
[1] 58140     3
dim(results_treated)
[1] 57914     7

Such duplicated entries can be identified using the duplicated function. This will be either TRUE or FALSE if a particular entry is repeated again in a column. Since we are interested in the entries that are not repeated we an use the R syntax ! which will reverse a set of logical values.

df <- data.frame(x = 1:10, y = c(LETTERS[1:6],LETTERS[1:4]))
df
filter(df, duplicated(y))
filter(df, !duplicated(y))

Fortunately, there are not too many duplicated entries so hopefully we won’t lose too much information if we discard the entries that are duplicates The first occurrence of the duplicated ID will still be included in the table.

anno <- AnnotationDbi::select(org.Hs.eg.db,keys=all_ids,
            columns=c("ENSEMBL","SYMBOL","GENENAME","ENTREZID"),
            keytype="ENSEMBL") %>% 
filter(!duplicated(ENSEMBL))
dim(anno)
[1] 57914     4

We can now join the annotation information to the results data frame. This can be done because both data frame contains the same set of identifiers. However, the column names are different so need to tell R which columns contain the same data (row and ENSEMBL respectively).

results_annotated <- left_join(results_treated, anno,by=c("row"="ENSEMBL")) 
head(results_annotated)

We can save the results table using the write.csv function, which writes the results out to a csv file that you can open in excel.

dir.create("de_analysis",showWarnings = FALSE)
write.csv(results_annotated,file="out_data/DESeq_treatment_Y_vs_N_annotated.csv",row.names=FALSE)

Exercise

  • Join the annotation table to your results from the DESeq analysis of TGF vs CTR. Save the resulting data frame as a csv file. e.g. (out_data/DESeq_condition_TGF_vs_CTR_annotated.csv)
  • The publication gives examples of COL1A1, COL1A2 and COL3A1 as genes that are up-regulated in TGF-treated samples vs controls (Figure 6C). Use your data to verify this by
      1. plotting the counts for these genes
      1. extracting their p-values

Visualisations incorporating gene names

The volcano plot

Similar to the MA-plot we saw earlier, the volcano plot is often used to display the results of a differential expression analysis. The x-axis shows the log\(_2\) fold-change and the y-axis is a measure of significance - typically a -log\(_{10}\) transformed version of the p-value.

We would not be able to visualise the p-values without any transformation (they would be squashed on the bottom of the y-axis). By using a log\(_{10}\) transformation we get quantities that are easier to visualise. Applying a minus sign to the result places more significant p-values at the top of the plot.

p <- c(0.05,0.01,0.0001)
log10(p)
[1] -1.30103 -2.00000 -4.00000
-log10(p)
[1] 1.30103 2.00000 4.00000

All the data we need for the plot is included in our annotated results table.

## We manually set the x-limit in this case
ggplot(results_annotated, aes(x = log2FoldChange, y = -log10(padj))) + geom_point() + xlim(-5,5)

To make the plot more useful we can label individual points. The general way of adding of adding labels is by using geom_text in conjunction with the label aesthetic. However, it is not very sensible to blindly add the geom_text function. We use the SYMBOL column from the results table as these are recognisable gene names.

ggplot(results_annotated,aes(x = log2FoldChange, y = -log10(padj),label=SYMBOL)) + geom_point() + geom_text()

The problem is that ggplot tries to label every gene that was tested in the analysis; which is clearly not very sensible.

Let’s say that we only want to label the top 20 most significant genes. The label we would want to give these genes is the gene name (SYMBOL), whereas the reminder of the genes should have no label. We can assign a set of labels based on where the genes are ranked in the gene list by using an ifelse statement in R.

With ifelse we can assign labels based on a logical condition. If the condition is TRUE then one label is assigned; if FALSE an alternative is assigned. In our case we want the label to be the gene symbol if the gene ranks less than 20 in. If not, the label is the empty string.

mutate(results_annotated, Rank = 1:n()) %>% 
  mutate(Label = ifelse(Rank <= 20, SYMBOL, ""))

We can now re-run the plot but using the newly created labels

mutate(results_annotated, Rank = 1:n()) %>% 
  mutate(Label = ifelse(Rank <= 20, SYMBOL, "")) %>% 
## notice how we set the label argument to be Label rather than SYMBOL
ggplot(aes(x = log2FoldChange, y = -log10(padj),label=Label)) + geom_point() + geom_text(col="blue")

A nicer way of displaying the names is provided by the ggrepel package

if(!require(ggrepel)) install.packages("ggrepel")
library(ggrepel)
mutate(results_annotated, Rank = 1:n()) %>% 
  mutate(Label = ifelse(Rank <= 20, SYMBOL, "")) %>% 
ggplot(aes(x = log2FoldChange, y = -log10(padj),label=Label)) + geom_point() + geom_text_repel(col="blue")

Making a heatmap from the differential expression results

A heatmap is a very popular method for visualising gene expression dataset. It shows the expression levels of a set of genes, and how those genes separate the samples in the dataset into groups. Like the PCA plot we used previously, it is an unsupervised method whereby known biological groups are added after the plot has been created.

However, we don’t use the whole expression matrix to construct the heatmap; attempting to do this will quickly cause R to crash. Moreoever, the plot wouldn’t be particularly informative as most genes are not expected to change between biological conditions, or won’t even be expressed.

We often use heatmaps to show a very small set of genes. The most statistically-significant genes are a popular choice as they should discriminate well between biological groups.

The first step is to identify the Ensembl IDs of the most significant genes

de_ids <- results_annotated %>% 
  arrange(padj) %>% 
  dplyr::slice(1:10) %>% 
  pull("row")
de_ids
 [1] "ENSG00000158258" "ENSG00000166396" "ENSG00000136999"
 [4] "ENSG00000115594" "ENSG00000019991" "ENSG00000070193"
 [7] "ENSG00000168398" "ENSG00000149256" "ENSG00000119508"
[10] "ENSG00000136960"

The heatmap tool we are going to use requires a count matrix as input, and our count matrices have Ensembl IDs as row identifiers. As with the boxplots and PCA plots from before this visualisation works best for transformed counts.

library(pheatmap)
vsd <- vst(dds)
heatmap_data <- assay(vsd)[de_ids,]
pheatmap(heatmap_data)

A dendrogram is used to show the relationship between samples; whereby samples that have a short “branch” joining them being similar to each other. However, samples being next to each other on the plot does not imply that they are closely related.

The code to generate the heatmap can be modified to include useful sample annotations. The meta data for this dataset is stored in dds but must undergone a bit of modification before we can include it.

## the meta data must be a data frame
samp_anno <- data.frame(colData(dds))
## the rownames must match the columns of the data being plotted
rownames(samp_anno) <- dds$Run
pheatmap(heatmap_data,
         annotation_col = samp_anno)

We can do some tidying-up as not all the columns in our annotation data frame need to be displayed

samp_anno <- dplyr::select(samp_anno, Treated, Replicate) %>% 
  mutate(Replicate = as.factor(Replicate))
pheatmap(heatmap_data,
         annotation_col = samp_anno)

Exercise:

  • Make another heatmap of the dataset, but this time using the 30 genes with the largest absolute fold-change
  • The cells in the heatmap are often scaled at a gene-level according to the change from mean rather than absolute value. Look at the help for pheatmap to see how to change the heatmap
  • It would be more informative to have the rows labeled according to the SYMBOL rather than Ensembl ID. Look at the help for pheatmap; which argument do you need to change? Change the heatmap so that SYMBOLs are shown for each row.

APPENDIX: Full DESeq workflow

The median of ratios normalisation method is employed in DESeq2 to account for sequencing depth and RNA composition. Let’s go through a short worked example (courtesy of https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html) to explain the process.

## create a small example matrix of "counts"
test_data <- matrix(c(1489,22,793,76,521,906,13,410,42,1196),nrow=5)
rownames(test_data) <- c("EF2A","ABCD1","MEFV","BAG1","MOV10")
colnames(test_data) <- c("SampleA","SampleB")
test_data
      SampleA SampleB
EF2A     1489     906
ABCD1      22      13
MEFV      793     410
BAG1       76      42
MOV10     521    1196

Firstly, an “average” or reference sample is created that represents the counts on a typical sample in the dataset. The geometric mean is used rather than the arithmetic mean. In other words the individual counts are multiplied rather than suHsed and the measure should be more robust to outliers.

psuedo_ref <- sqrt(rowProds(test_data))
psuedo_ref
[1] 1161.47923   16.91153  570.20172   56.49779  789.37697

A ratios of sample to “psuedo reference” are then calculated for each gene. We are assuming that most genes are not changing dramatically, so this ratio should be somewhere around 1.

test_data/psuedo_ref
        SampleA   SampleB
EF2A  1.2819859 0.7800398
ABCD1 1.3008873 0.7687061
MEFV  1.3907359 0.7190438
BAG1  1.3451854 0.7433919
MOV10 0.6600142 1.5151189

DESeq2 defines size factors as being the median of these ratios for each sample (median is used so any outlier genes will not affect the normalisation).

norm_factors <- colMedians(test_data/psuedo_ref)
norm_factors
[1] 1.3008873 0.7687061

Individual samples can then normalised by dividing the count for each gene by the corresponding normalization factor.

test_data[,1] / norm_factors[1]
      EF2A      ABCD1       MEFV       BAG1      MOV10 
1144.60340   16.91153  609.58395   58.42166  400.49589 

and for the second sample…

test_data[,2] / norm_factors[2]
      EF2A      ABCD1       MEFV       BAG1      MOV10 
1178.60387   16.91153  533.36378   54.63727 1555.86118 

The size factors for each sample in our dataset can be calculated using the estimateSizeFactorsForMatrix function.

sf <- estimateSizeFactorsForMatrix(assay(dds))
sf
 1_CTR_BC_2  2_TGF_BC_4   3_IR_BC_5  4_CTR_BC_6  5_TGF_BC_7 
  1.0549852   1.1812947   0.8928435   1.0990482   1.0518069 
 6_IR_BC_12 7_CTR_BC_13 8_TGF_BC_14  9_IR_BC_15 
  0.8626010   0.9491286   1.0643407   0.9783918 

The estimation of these factors can also take gene-lengths into account, and this is implemented in the estimateSizeFactors function. Extra normalization factor data is added to the dds object.

dds <- estimateSizeFactors(dds)
dds

In preparation for differential expression DESeq2 also need a reliable estimate of the variability of each gene; which it calls dispersion.

dds <- estimateDispersions(dds)
dds

A statistical test can then be applied. As the data are count-based and not normally-distributed a t-test would not be appropriate. Most tests are based on a Poisson or negative-binomial distribution; negative binomial in the case of DESeq2. Although you might not be familiar with the negative binomial, the results should be in a familiar form with fold-changes and p-values for each gene.

dds <- nbinomWaldTest(dds)

It may seem like there is a lot to remember, but fortunately there is one convenient function that will apply the three steps. The messages printed serve as reminders of the steps included.

---
title: "BMS31004 Bioinformatics for Biomedical Science - Week 5"
author: "Module Coordinator Mark Dunning"
output: 
  html_notebook: 
    toc: yes
    toc_float: yes
    css: stylesheets/styles.css
editor_options: 
  chunk_output_type: inline
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,message=FALSE,warning=FALSE)
library(dplyr)

```



# Learning outcomes

- Obtaining differential expression statistics for different contrasts
- Using annotation databases to map between gene identifers
- Construction and interpretation of common visualisations for RNA-seq
    + scatter plots
    + volcano plots
    + MA-plots 
    + heatmaps

We can now have a list of genes ordered according to their evidence for being differentially-expressed. You should have saved a `de_treated.rds` object in the previous session.

```{r}
de_treated <- readRDS("Robjects/de_treated.rds")
dds <- readRDS("Robjects/dds.rds")
```

If you don't have it, you can download directly into RStudio using the code below:-

```{r eval=FALSE} 
dir.create("Robjects/")
download.file("https://github.com/sheffield-bioinformatics-core/bms31004_2022/raw/main/Robjects/de_treated.rds",destfile = "Robjects/de_treated.rds")
download.file("https://github.com/sheffield-bioinformatics-core/bms31004_2022/raw/main/Robjects/dds.rds",destfile = "Robjects/dds.rds")


de_treated <- readRDS("Robjects/de_treated.rds")
dds <- readRDS("Robjects/dds.rds")
```

```{r}
library(tidyverse)
library(DESeq2)
```


An overview of the results is to use the `plotMA` function. Each point on this plot represents and individual gene with the x- and y-axes being the overall expression level and magnitude of difference respectively. Significant genes are automatically highlighted, although at this stage we don't have a convenient way of labeling the genes. The fanning effect at low expression levels is often seen due to high relative fold-change at low expression levels. 

```{r}
plotMA(de_treated)
```

It is also instructive to perform a "sanity" check and plot the sample-level counts for genes with high significance. This could highlight any other technical factors that we are not currently taking into account. The plot is not particularly attractive, but is a good quick diagnostic. If you pick genes with a large fold-change (and/or low p-value) you can verify the "direction" of the contrast and which sample group was used as a baseline.

```{r}
results_treated <- results(de_treated, tidy=TRUE) %>% arrange(padj)
```


```{r}
## this gene has a positive fold-change
plotCounts(dds,"ENSG00000158258",intgroup = "Treated")
```

```{r}
## this gene has a negative fold-change
plotCounts(dds,"ENSG00000136999",intgroup = "Treated")
```

From the plots we can confirm that un-treated samples (annotated with `N` in the `Treatment` group) was used as a baseline; in other words a positive fold-change means a higher expression level in Treated samples. We will see shortly how to change the direction of the contrast.

<div class="information">
If your study involves knocking-out a particular gene, or you have some positive controls that are known in advance, it would be a good idea to visualise their expression level with `plotCounts`.
</div>

# Exercise

<div class="exercise">
- Re-run the `DESeq` function to find differentially-expressed genes between the `TGF` treated samples and `CTRL`. You will need to change the `design` that is used.
- Write a csv file that contains results for the genes that have a p-value less than 0.05 and a log2 fold change more than 1, or less than -1.
  - HINT: So that we don't overwrite our results so far, it may be convenient to create a new `DESeqDataSet` object for the new differential expression analysis. Check the `colData` to see which analyses can be made
- Use the `plotCounts` function to visually-inspect the most statistically-significant gene identified

```{r eval=FALSE}

dds_condition<- dds
colData(dds_condition)
## You will need to change this line to choose the correct comparison
design(dds_condition) <- ~...
```

</div>



# Changing the direction of the contrast

In this initial analysis `DESeq2` has automatically decided which member of our sample groups to use as our baseline (`CTR` in this case) so that the log2 fold changes are reported with a positive value meaning higher expression in Treated. If we want to change this behaviour we can change the `contrast` argument in the `results` function


```{r eval=FALSE}
## This should give the same as the table above
results(de_treated, contrast=c("Treated","Y","N"))
## Changing the direction of the contrast
results(de_treated, contrast=c("Treated","N","Y"))
```

# Analyses involving more than two groups

If we change to performing differential expression analysis on the `condition` variable then there are various contrasts that can be made; `IR` vs `CTR`, `TGF` vs `CTR` etc. When the `results` function is run, the table that is displayed is for the contrast `TGF vs CTR`. The `resultsNames` function can tell us which other contrasts we can access.

```{r}
dds_condition<- dds
dds_condition$condition <- as.factor(stringr::str_to_upper(dds_condition$condition))
design(dds_condition) <- ~condition
de_condition <- DESeq(dds_condition)
resultsNames(de_condition)
```
For a factor with more than 2 groups, the `contrast` argument can be used to change the output. Here, we compare the `IR` to `CTR` -treated groups.

```{r }
results(de_condition, contrast = c("condition","IR","CTR"),tidy=TRUE)  %>%
arrange(padj) 
```

If we wanted to change the analysis to `IR` versus `TGF` we don't have to re-run the analysis. The `results` function can be used to extract a different contrast.

```{r}
results(de_condition, contrast = c("condition","IR","TGF"),tidy=TRUE) %>%
arrange(padj) 
```
# Analysing a subset of the data

The `tximport` and `DESeq2` packages make it quite convenient to import a set of RNA-seq counts into R. However, we don't neccesarily have to use all the samples in our differential expression analysis.

For example, we might decide, based on the sample QC, that one of our samples should be removed as an outlier.

The `dds` object can be treated with the usual subsetting conventions as a count matrix (even though it is a more complex data type). Specifically we can define a set of column indices to subset the samples.

e.g. this will return a `DESeqDataset` using the first four samples from `dds`.

```{r}
dds[,1:4]
```


```{r eval=FALSE}
samples_to_keep <- c("1_CTR_BC_2","2_TGF_BC_4","3_IR_BC_5",
                     "4_CTR_BC_6","5_TGF_BC_7", "6_IR_BC_12",
                     "7_CTR_BC_13","8_TGF_BC_14","9_IR_BC_15")
dds_final <- dds[,dds$Run %in% samples_to_keep]
## Proceed as usual
```


```{r eval=FALSE}
samples_to_remove <- "7_CTR_BC_13"
dds_final <- dds[,!dds$Run %in% samples_to_remove]

```

Or based on some criteria such as the number of reads

```{r}
number_of_reads <- colSums(counts(dds)) / 1e6
dds_final <- dds[,number_of_reads > 35]
```
Finally, we might decide to analyse a specific biological group.

```{r}
dds_treated <- dds[,dds$Treated == "Y"]
```

# More-complex designs

The examples we have used so far have performed a differential expression analysis using a named column in the `colData` object. The `DESeq2` package is capable of performing more complex analyses that can take multiple factors into consideration at the same time; so-called "multi-factor designs"

- [Multi-factor designs in DESeq2](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs)

The use of such a design could be motivated by discovering sources of technical variation in our data that might obscure the biological differences we would like to compare. e.g.

![](images/batch_effect.png)

In the example image above the main source of variation is the batch in which the samples were sequenced. A multi-factor analysis to compare the various conditions, but "correct" for differences in batch, would be as follows.

```{r eval=FALSE}
### Don't run this. It's just a code example
design(MY_DATA) <- ~ batch + condition
```

Likewise, if we have different treatments applied to difference cell-lines, but the main source of variation is the cell line (as in Weekly Exercise 4), the following could be used.

```{r eval=FALSE}
### Don't run this. It's just a code example
design(MY_DATA) <- ~cell_line + treatment
```

# Adding annotation to the DESeq2 results

We would love to share this list with our collaborators, or search for our favourite gene in the results. However, the results are not very useful in there current form as each row is named according to an *Ensembl* identifier. Whilst gene symbols are problematic and can change over time, they are the names that are most recognisable and make the results easier to navigate.



There are a number of ways to add annotation, but we will demonstrate how to do this using the *org.Hs.eg.db* package. This package is one of several *organism-level* packages in Bioconductor that are re-built every 6 months. These packages are listed on the [annotation section](http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData) of the Bioconductor, and are installed in the same way as regular Bioconductor packages. An alternative approach is to use `biomaRt`, an interface to the [BioMart](http://www.biomart.org/) resource. BioMart is much more comprehensive, but the organism packages fit better into the Bioconductor workflow.


```{r eval=FALSE}
### Only execute when you need to install the package
install.packages("BiocManager")
BiocManager::install("org.Hs.eg.db")
# For Human
BiocManager::install("org.Hs.eg.db")
```

The packages are larger in size that Bioconductor software pacakges, but essentially they are databases that can be used to make *offline* queries. 

```{r message=FALSE}
library(org.Hs.eg.db)
```


<div class="information">
The following error may be seen loading this library on some versions of RStudio. 

```
Show in New WindowClear OutputExpand/Collapse Output
Error: package or namespace load failed for ‘org.Hs.eg.db’: .onLoad failed in loadNamespace() for 'org.Hs.eg.db', details: call: l$contains error: $ operator is invalid for atomic vectors
```

A solution is presented in this [forum post](https://community.rstudio.com/t/problem-in-loading-namespace-of-org-hs-eg-db-package-in-rstudio-server-error-operator-is-invalid-for-atomic-vectors/101789)


```{r eval=FALSE}
options(connectionObserver = NULL)
library(org.Hs.eg.db)
```


</div>


First we need to decide what information we want. In order to see what we can extract we can run the `columns` function on the annotation database.

```{r}
columns(org.Hs.eg.db)
```

We are going to filter the database by a key or set of keys in order to extract the information we want. Valid names for the key can be retrieved with the `keytypes` function.

```{r}
keytypes(org.Hs.eg.db)
```

We should see `ENSEMBL`, which is the type of key we are going to use in this case. If we are unsure what values are acceptable for the key, we can check what keys are valid with `keys`

```{r}
keys(org.Hs.eg.db, keytype="ENSEMBL")[1:10]
```



For the top gene in our analysis the call to the function would be:-

```{r eval=FALSE}
select(org.Hs.eg.db, keys="ENSG00000158258",
       keytype = "ENSEMBL",columns=c("SYMBOL","GENENAME")
)
```

Unfortunately, the authors of `dplyr` and `AnnotationDbi` have both decided to use the name `select` in their packages. To avoid confusion and errors, the following code is sometimes used to explictly tell RStudio to use the `select` function from `AnnotationDbi`.

```{r}
AnnotationDbi::select(org.Hs.eg.db, keys="ENSG00000158258",keytype = "ENSEMBL",columns=c("SYMBOL","GENENAME"))
```


To annotate our results, we definitely want gene symbols and perhaps the full gene name. Let's build up our annotation information into a new data frame using the `select` function. The `keys` argument to `select` is a *vector* of valid identifiers. The names of all Ensembl IDs used in the analysis can be found as the `rownames` of our `DESeq2` object. 

```{r}
## Get the names of all genes used in the analysis
all_ids <- rownames(dds)
anno <- AnnotationDbi::select(org.Hs.eg.db,keys=all_ids,
              columns=c("SYMBOL","GENENAME"),
              keytype="ENSEMBL")
# Have a look at the annotation
head(anno)
```

However, we have a problem that the resulting data frame has more rows than our results table. This is due to the *one-to-many* relationships that often occur when mapping between various identifiers.

```{r}
dim(anno)
dim(results_treated)
```

Such duplicated entries can be identified using the `duplicated` function. This will be either `TRUE` or `FALSE` if a particular entry is repeated again in a column. Since we are interested in the entries that are *not* repeated we an use the R syntax `!` which will reverse a set of logical values.

```{r}
df <- data.frame(x = 1:10, y = c(LETTERS[1:6],LETTERS[1:4]))
df
```

```{r}
filter(df, duplicated(y))
```

```{r}
filter(df, !duplicated(y))
```


Fortunately, there are not too many duplicated entries so hopefully we won't lose too much information if we discard the entries that are duplicates The first occurrence of the duplicated ID will still be included in the table.

```{r}
anno <- AnnotationDbi::select(org.Hs.eg.db,keys=all_ids,
            columns=c("ENSEMBL","SYMBOL","GENENAME","ENTREZID"),
            keytype="ENSEMBL") %>% 
filter(!duplicated(ENSEMBL))
dim(anno)
```


We can now join the annotation information to the `results` data frame. This can be done because both data frame contains the same set of identifiers. However, the column names are different so need to tell R which columns contain the same data (`row` and `ENSEMBL` respectively).

```{r}
results_annotated <- left_join(results_treated, anno,by=c("row"="ENSEMBL")) 
head(results_annotated)
```


We can save the results table using the `write.csv` function, which writes the results out to a csv file that you can open in excel.

```{r}
dir.create("de_analysis",showWarnings = FALSE)
write.csv(results_annotated,file="out_data/DESeq_treatment_Y_vs_N_annotated.csv",row.names=FALSE)
```


# Exercise

<div class="exercise">

- Join the annotation table to your results from the DESeq analysis of `TGF` vs `CTR`. Save the resulting data frame as a csv file. e.g. (`out_data/DESeq_condition_TGF_vs_CTR_annotated.csv`)
- The publication gives examples of `COL1A1`, `COL1A2` and `COL3A1` as genes that are *up-regulated* in TGF-treated samples vs controls (Figure 6C). Use your data to verify this by 
  + i) plotting the counts for these genes 
  + ii) extracting their p-values


</div>

```{r}

```



# Visualisations incorporating gene names

## The volcano plot

Similar to the MA-plot we saw earlier, the volcano plot is often used to display the results of a differential expression analysis. The x-axis shows the log$_2$ fold-change and the y-axis is a measure of significance - typically a -log$_{10}$ transformed version of the p-value.

![](https://upload.wikimedia.org/wikipedia/commons/7/74/Volcano_eg.jpg)

We would not be able to visualise the p-values without any transformation (they would be squashed on the bottom of the y-axis). By using a log$_{10}$ transformation we get quantities that are easier to visualise. Applying a minus sign to the result places more significant p-values at the top of the plot.

```{r}
p <- c(0.05,0.01,0.0001)
log10(p)
-log10(p)
```


All the data we need for the plot is included in our annotated results table.

```{r}
## We manually set the x-limit in this case
ggplot(results_annotated, aes(x = log2FoldChange, y = -log10(padj))) + geom_point() + xlim(-5,5)
```

To make the plot more useful we can label individual points. The general way of adding of adding labels is by using `geom_text` in conjunction with the `label` aesthetic. However, it is not very sensible to blindly add the `geom_text` function. We use the `SYMBOL` column from the results table as these are recognisable gene names.

```{r}
ggplot(results_annotated,aes(x = log2FoldChange, y = -log10(padj),label=SYMBOL)) + geom_point() + geom_text()
```

The problem is that `ggplot` tries to label every gene that was tested in the analysis; which is clearly not very sensible.

Let's say that we only want to label the top 20 most significant genes. The label we would want to give these genes is the gene name (`SYMBOL`), whereas the reminder of the genes should have no label. We can assign a set of labels based on where the genes are ranked in the gene list by using an `ifelse` statement in R.

With `ifelse` we can assign labels based on a logical condition. If the condition is `TRUE` then one label is assigned; if `FALSE` an alternative is assigned. In our case we want the label to be the gene symbol if the gene ranks less than 20 in. If not, the label is the empty string.

```{r}
mutate(results_annotated, Rank = 1:n()) %>% 
  mutate(Label = ifelse(Rank <= 20, SYMBOL, ""))
```
We can now re-run the plot but using the newly created labels

```{r}
mutate(results_annotated, Rank = 1:n()) %>% 
  mutate(Label = ifelse(Rank <= 20, SYMBOL, "")) %>% 
## notice how we set the label argument to be Label rather than SYMBOL
ggplot(aes(x = log2FoldChange, y = -log10(padj),label=Label)) + geom_point() + geom_text(col="blue")
```

A nicer way of displaying the names is provided by the `ggrepel` package

```{r}
if(!require(ggrepel)) install.packages("ggrepel")
```

```{r}
library(ggrepel)
mutate(results_annotated, Rank = 1:n()) %>% 
  mutate(Label = ifelse(Rank <= 20, SYMBOL, "")) %>% 
ggplot(aes(x = log2FoldChange, y = -log10(padj),label=Label)) + geom_point() + geom_text_repel(col="blue")
```


## Making a heatmap from the differential expression results

A heatmap is a very popular method for visualising gene expression dataset. It shows the expression levels of a set of genes, and how those genes separate the samples in the dataset into groups. Like the PCA plot we used previously, it is an unsupervised method whereby known biological groups are added after the plot has been created.

However, we don't use the whole expression matrix to construct the heatmap; attempting to do this will quickly cause R to crash. Moreoever, the plot wouldn't be particularly informative as most genes are not expected to change between biological conditions, or won't even be expressed.

We often use heatmaps to show a very small set of genes. The most statistically-significant genes are a popular choice as they should discriminate well between biological groups.

The first step is to identify the *Ensembl* IDs of the most significant genes

```{r}
de_ids <- results_annotated %>% 
  arrange(padj) %>% 
  dplyr::slice(1:10) %>% 
  pull("row")
de_ids
```
The heatmap tool we are going to use requires a count matrix as input, and our count matrices have Ensembl IDs as row identifiers. As with the boxplots and PCA plots from before this visualisation works best for transformed counts.

```{r}
library(pheatmap)
vsd <- vst(dds)
heatmap_data <- assay(vsd)[de_ids,]
pheatmap(heatmap_data)
```

A *dendrogram* is used to show the relationship between samples; whereby samples that have a short "branch" joining them being similar to each other. However, samples being next to each other on the plot **does not** imply that they are closely related.

The code to generate the heatmap can be modified to include useful sample annotations. The meta data for this dataset is stored in `dds` but must undergone a bit of modification before we can include it.
 
```{r}
## the meta data must be a data frame
samp_anno <- data.frame(colData(dds))
## the rownames must match the columns of the data being plotted
rownames(samp_anno) <- dds$Run
pheatmap(heatmap_data,
         annotation_col = samp_anno)
```

We can do some tidying-up as not all the columns in our annotation data frame need to be displayed

```{r}
samp_anno <- dplyr::select(samp_anno, Treated, Replicate) %>% 
  mutate(Replicate = as.factor(Replicate))
pheatmap(heatmap_data,
         annotation_col = samp_anno)
```

<div class="exercise">
**Exercise**: 

- Make another heatmap of the dataset, but this time using the 30 genes with the largest absolute fold-change
- The cells in the heatmap are often *scaled* at a gene-level according to the change from mean rather than absolute value. Look at the help for `pheatmap` to see how to change the heatmap
- It would be more informative to have the rows labeled according to the `SYMBOL` rather than Ensembl ID. Look at the help for `pheatmap`; which argument do you need to change? Change the heatmap so that `SYMBOL`s are shown for each row.

</exercise>

# APPENDIX: Full DESeq workflow

The median of ratios normalisation method is employed in DESeq2 to account for *sequencing depth* and *RNA composition*. Let's go through a short worked example (courtesy of [https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html](here)) to explain the process.

```{r}
## create a small example matrix of "counts"
test_data <- matrix(c(1489,22,793,76,521,906,13,410,42,1196),nrow=5)
rownames(test_data) <- c("EF2A","ABCD1","MEFV","BAG1","MOV10")
colnames(test_data) <- c("SampleA","SampleB")
test_data
```

Firstly, an "average" or reference sample is created that represents the counts on a typical sample in the dataset. The *geometric mean* is used rather than the *arithmetic* mean. In other words the individual counts are multiplied rather than suHsed and the measure should be more robust to outliers.


```{r}
psuedo_ref <- sqrt(rowProds(test_data))
psuedo_ref
```

A ratios of sample to "psuedo reference" are then calculated for each gene. We are assuming that most genes are not changing dramatically, so this ratio should be somewhere around 1.

```{r}
test_data/psuedo_ref
```

`DESeq2` defines size factors as being the *median* of these ratios for each sample (median is used so any outlier genes will not affect the normalisation). 

```{r}
norm_factors <- colMedians(test_data/psuedo_ref)
norm_factors
```

Individual samples can then normalised by dividing the count for each gene by the corresponding normalization factor.

```{r}
test_data[,1] / norm_factors[1]
```

and for the second sample...

```{r}
test_data[,2] / norm_factors[2]

```

The size factors for each sample in our dataset can be calculated using the `estimateSizeFactorsForMatrix` function.

```{r}
sf <- estimateSizeFactorsForMatrix(assay(dds))
sf
```

The estimation of these factors can also take gene-lengths into account, and this is implemented in the `estimateSizeFactors` function. Extra normalization factor data is added to the `dds` object.



```{r eval=FALSE}
dds <- estimateSizeFactors(dds)
dds
```

In preparation for differential expression DESeq2 also need a reliable estimate of the variability of each gene; which it calls *dispersion*. 

```{r eval=FALSE}
dds <- estimateDispersions(dds)
dds

```

A statistical test can then be applied. As the data are count-based and not normally-distributed a t-test would not be appropriate. Most tests are based on a *Poisson* or *negative-binomial* distribution; negative binomial in the case of `DESeq2`. Although you might not be familiar with the negative binomial, the results should be in a familiar form with fold-changes and p-values for each gene.

```{r eval=FALSE}
dds <- nbinomWaldTest(dds)
```

It may seem like there is a lot to remember, but fortunately there is one convenient function that will apply the three steps. The messages printed serve as reminders of the steps included.